This is the first of a series of two notebooks, where we forecast sales for a retailers with 45 stores, each contains several departments, in multiple regions. We also want to understand the extent to which external factors, including holidays and promotions, affect sales in different departments, stores, and regions. In this notebook, we are going to focus on the exploration and manipulation of our data, whereas the next one would be centered around modeling itself. Understanding the data is important for analytics, and we recommend you to read both notebooks in order to gain a better grasp of analytics for big and messy data. The three datasets that we are using are already included (see the "Data" folder) in the repository where this Jupyter notebook is located, but you can also find them on this Kaggle page.
One of the most commonly used time series forecasting model are ARIMA models, which is what we are going to use in this notebook. However, there are also other models that you could use, such as ETS Models. At the same time, since we want to understand the relationships between sales and several other variables, including mark downs, holidays, and potentially weather (temperature), it would be benefitial for us to build a Simple or Multiple Linear Regression Model first, with sales as the response variable. Then we can build time series models to forecast the explanatory variables, such as temperature, and plug in the time-series predicted explantory variables into the SLR/MLR models to generate further forecasts for the response variable,sales. Models generated in this manner might be less accurate as we are going through multiple steps, each adding more risks for errors; however, time series models combined with explanatory models are more interpretable and provide you with more managerial insights that could help you make recommendations and decisions. Let's go ahead and import the libraries we need and load our data:
# basic tools for data analysis and visualization in Python
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# ipywidgets are used to make interactive contents in Jupyter notebooks
from ipywidgets import interact
df_sales = pd.read_csv('Data/sales.csv')
df_features = pd.read_csv('Data/features.csv')
df_stores = pd.read_csv('Data/stores.csv')
# converting the date column (initally stored as strings) to dates
df_sales['Date'] = pd.to_datetime(df_sales['Date'], format='%d/%m/%Y')
df_features['Date'] = pd.to_datetime(df_features['Date'], format='%d/%m/%Y')
df_sales['Store'] = df_sales['Store'].astype('category')
df_sales['Dept'] = df_sales['Dept'].astype('category')
df_features['Store'] = df_features['Store'].astype('category')
df_stores['Store'] = df_stores['Store'].astype('category')
df_stores['Type'] = df_stores['Type'].astype('category')
# Information on the SettingWithCopywarning that might occur:
# https://www.dataquest.io/blog/settingwithcopywarning/
Before we build a model, it is always a good idea to take a look at the data itself. Some questions that you might ask when exploring your data are:
After exploration, if there are lots of missing values or if the distribution of a column is highly left/right skewed, which could potentially affect the accruacy of your models, you would want to think about replacing or transforming them before you build, compare, and select your models.
def sales_viz(store_num, dept_num):
fig, axes = plt.subplots(1,2, figsize=(10,3), dpi=100)
sales_temp = df_sales[(df_sales.Store==store_num) & (df_sales.Dept==dept_num)][['Date', 'Weekly_Sales']]
ts_sales_temp = sales_temp.set_index('Date')
ts_sales_temp.plot(ax=axes[0], legend=False)
ts_sales_temp.hist(ax=axes[1], bins=100)
return (display(ts_sales_temp.describe().transpose().round(2)), plt.show())
interact(sales_viz, store_num={n:n for n in range(1,46)}, dept_num={n:n for n in range(1,100)})
print(df_sales.info())
display(df_sales.describe().transpose().round(2))
print("CAUTION: {:.2%} of the `Weekly_Sales` column is negative."
.format(len([n for n in df_sales['Weekly_Sales'] if n < 0])/
len(df_sales['Weekly_Sales'])))
fig, axes = plt.subplots(1,3, figsize=(20,4))
df_sales['Weekly_Sales'].hist(bins=100, ax=axes[0])
sns.boxplot(x='Store', y='Weekly_Sales', whis=100, data=df_sales, ax=axes[1])
sns.boxplot(x='Dept', y='Weekly_Sales', whis=100, data=df_sales, ax=axes[2])
plt.show()
display(df_sales.drop('Weekly_Sales', axis=1).describe().transpose())
fig, axes = plt.subplots(1,3, figsize=(20,4))
sns.countplot(x='IsHoliday', data=df_sales, ax=axes[0])
sns.countplot(x='Store', data=df_sales, ax=axes[1])
sns.countplot(x='Dept', data=df_sales, ax=axes[2])
plt.show()
99.68% of the Weekly_Sales column is positive. We could probably just drop the negative values so that we could take logarithms of the rest in order to make our data less skewed.
Now that we have taken a look at the aggregated data, we want to look at the sales of each store and department on an individual level. However, it is not the most efficient to print or plot the subsets one by one. Luckily, the interact() function from the ipywidgets package allow us to create an interactive "dashboard" in Jupyter notebooks:
We can see that:
sales dataframe. We should keep in mind at this point that, on each day, sales across multiple stores with several departments are recorded, even though we do not know if all stores or departments are open on each of the 143 days. weekly sales column is extremely right skewed, with a mean of \$15,981, but a 50th percentile (mostly the same as the median) of only \\$7,612. The range of the column is also fairly large, from -\$4,989 to \\$693,099.weekly sales records are extremely right skewed (maybe most of the sales happen on holidays).Features¶Now let's move onto the features dataframe:
print(df_features.info())
display(df_features[df_features['CPI'].isnull()])
display(df_features[df_features['Unemployment'].isnull()])
display(df_features.describe().round(2))
for i in range(5):
print('CAUTION: {:.2%} of MarkDown{} (non-null) is negative.'
.format(len([n for n in df_features['MarkDown'+str(i+1)] if n < 0])/
len(df_features['MarkDown'+str(i+1)].dropna()), i+1))
df_features.drop(['Store', 'Date', 'IsHoliday'], axis=1).hist(figsize=(15,10), bins=100)
plt.show()
display(df_features[['Store', 'Date', 'IsHoliday']].describe().transpose())
fig, axes = plt.subplots(1,2, figsize=(10,3))
sns.countplot(x='IsHoliday', data=df_features, ax=axes[0])
sns.countplot(x='Store', data=df_features, ax=axes[1])
plt.show()
Right away, we can see that:
sales dataframe, there are less records (8190) in the features dataframe. It makes sense because this dataframe is only detailed to store level and not department; for example, different departments in the same store probably experience the same weather on the same day. sales dataframe.features records are holidays, which is consistent with the sales dataframeStores¶Last but not least, let's take a look at the stores dataframe:
display(df_stores.info())
display(pd.DataFrame(df_stores.drop('Store', axis=1).describe(include='all')).transpose().round(2))
fig, axes = plt.subplots(1,3, figsize=(15,3))
sns.stripplot(x='Type', y='Size', data=df_stores, ax=axes[0])
sns.countplot(x='Type', data=df_stores, ax=axes[1])
sns.distplot(df_stores['Size'], bins=15, kde=False, ax=axes[2])
plt.show()
The stores dataframe appear to be smaller and simper. We can see that:
store type and other variables in the three dataframesSizes of the stores are not extremely skewed, with its mean close to its 50th percentile. However, we might want to check if they are light or heavy tailed.At a glance (by randomly inspecting different combinations), seasonalities and trends vary across stores and departments. Some stores/deparments have consistant sales throughout time, while some others see huge spikes either in the middle or at the end of each year. There are also increasing or decreasing trends in certain stores/departments.
Even though is nice to be able to see detailed sales performance for each individual department in every store, we also want to inspect whether there are shared patterns in our sales data. Thus, let's aggregate the data a little bit so that we could visualize and compare sales across different stores for the same department and then across different departments in the same store.
fig, ax = plt.subplots(9, 5, sharex='col', sharey='row', figsize=(15, 20), dpi=100)
store_num = pd.Series(range(1,46), dtype='category')
store_index = 0
for i in range(9):
for j in range(5):
sales_temp = df_sales[df_sales.Store==store_num[store_index]][['Date', 'Dept', 'Weekly_Sales']]
sales_temp['Dept'] = sales_temp['Dept'].astype('category')
sns.lineplot(x='Date', y='Weekly_Sales', hue='Dept', data=sales_temp, legend=False, ax=ax[i,j])
store_index += 1
fig, ax = plt.subplots(20, 5, sharex='col', sharey='row', figsize=(15, 40), dpi=100)
dept_num = pd.Series(range(1,101), dtype='category')
dept_index = 0
for i in range(20):
for j in range(5):
sales_temp = df_sales[df_sales.Dept==dept_num[dept_index]][['Date', 'Store', 'Weekly_Sales']]
sales_temp['Store'] = sales_temp['Store'].astype('category')
sns.lineplot(x='Date', y='Weekly_Sales', hue='Store', data=sales_temp, legend=False, ax=ax[i,j])
dept_index += 1
df_all = df_sales.merge(df_features,
on=['Store', 'Date', 'IsHoliday'],
how='left').merge(df_stores,
on='Store',
how='left')
print(df_all.info())
mon_val = ['Weekly_Sales', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
df_adj = pd.concat([df_all.drop(mon_val, axis=1),
(df_all[mon_val]*100).div(df_all.CPI, axis=0).add_prefix('AdjCPI_')], axis=1)
adj_val = ['AdjCPI_' + s for s in mon_val if s != 'Fuel_Price']
display(df_adj[adj_val].describe().transpose())
df_adj[adj_val].hist(bins=100, figsize=(13,10))
plt.show()
df_pos = df_adj[((df_adj[adj_val]>0)|(df_adj[adj_val].isnull())).all(1)]
print(df_adj.shape)
print(df_pos.shape)
df_trans = pd.concat([df_pos.drop(adj_val, axis=1),
df_pos[adj_val].apply(lambda x: np.log(x)).add_prefix('TransLog_')], axis=1)
df_trans.info()
trans_val = ['TransLog_' + s for s in adj_val]
display(df_trans[trans_val].describe().transpose())
df_trans[trans_val].hist(bins=100, figsize=(13,10))
plt.show()
df_full = pd.concat([df_trans.drop(trans_val, axis=1), df_trans[trans_val].fillna(0.0)], axis=1)
df_full.info()
sns.pairplot(df_full.drop(['Store', 'Dept', 'Date', 'IsHoliday', 'Size',
'CPI', 'AdjCPI_Fuel_Price', 'Unemployment'], axis=1), hue='Type')